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^ ' Abstract 

We introduce a new framework for the convergence analysis of a class of distributed constrained 
non-convex optimization algorithms in multi-agent systems. The aim is to search for local minimizers 
of a non-convex objective function which is supposed to be a sum of local utility functions of the 
agents. The algorithm under study consists of two steps: a local stochastic gradient descent at each 
agent and a gossip step that drives the network of agents to a consensus. Under the assumption of 
decreasing stepsize, it is proved that consensus is asymptotically achieved in the network and that the 
algorithm converges to the set of Karush-Kuhn-Tucker points. As an important feature, the algorithm 
does not require the double-stochasticity of the gossip matrices. It is in particular suitable for use in a 
natural broadcast scenario for which no feedback messages between agents are required. It is proved 
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l/^ ' that our result also holds if the number of communications in the network per unit of time vanishes 

(N 

^^ , to power allocation in wireless ad-hoc networks are discussed. Finally, we provide numerical results 
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at moderate speed as time increases, allowing potential savings of the network's energy. Applications 



which sustain our claims. 



^ ■ I. Introduction 



Stochastic gradient descent is a widely used procedure for finding critical points of an unknown 
function / Il32l . Formally, it can be summarized as an iterative scheme of the form 6'„+i = 
dn + 7n+i(— V/(6'„) + ^„+i) where V is the gradient operator and where ^„+i represents a 
random perturbation. Relevant selection of the step size 7„ ensures that, for a well behaved 
function /, sequence {9n)n&n will eventually converge to a critical point. 
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In this paper, we investigate a distributed optimization problem which is of practical interest 
in many multi-agent contexts such as parallel computing [[HI, statistical estimation [[34ll . |[33l . [[U, 
[[281 , robotics [[T2l or wireless networks [[291 . Consider a network of A^ agents. To each agent 
i = 1, . . . , N, we associate a possibly non-convex continuously differentiable utility function 
/j : M'^ — 7- M where d eN. Let G C M^^ be a nonempty compact convex subset. We address the 
the following optimization problem: 
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rmr^Y^im. (1) 

4 = 1 

The set G is assumed to be known by all agents. However, a given agent i ignores the utility 
functions /j's of other agents j ^ i. Cooperation between agents is therefore needed to find 
minimizers of ©• Moreover, any utility function fi may be unperfectly observed by agent i 
itself, due to the presence of random observation noise. We thus address the framework of 
distributed stochastic approximation. 

The literature contains at least two different cooperation approaches for solving ([T]). The so- 
called incremental approach is used by [[26l . [[23l . [[27ll . [[30l : a message containing an estimate 
of the desired minimizer iteratively travels all over the network. At any instant, the agent which 
is in possession of the message updates its own estimate and adds its own contribution, based 
on its local observation. Incremental algorithms generally require the message to go through a 
Hamiltonian cycle in the network. Finding such a path is known to be a NP complete problem 
and is not particularly suitable to distributed computations. Relaxations of the Hamiltonian cycle 
requirement have been proposed: for instance, [[23l only requires that an agent communicates with 
another agent randomly selected in the network (not necessarily in its neighborhood) according 
to the uniform distribution. However, substantial routing is still needed. In [jl9i[, problem ([Tj) is 
solved using a different approach, assuming that agents perfectly observe their utility functions 
and know also the utility functions of their neighbors. 

This paper focuses on another cooperation approach based on average consensus techniques. 
In this context, each agent maintains its own estimate. Agents separately run local gradient 
algorithms and simultaneously communicate in order to eventually reach an agreement over 
the whole network on the value of the minimizer. Communicating agents combine their local 
estimates in a linear fashion: a receiver computes a weighted average between its own estimate 
and the ones which have been transmitted by its neighbors. Such combining techniques are often 



refered to as gossip methods. 

The idea beyond the algorithm of interest in this paper is not new. Its roots can be found in 
(381, Il39ll where a network of processors seeks to optimize some objective function known by all 
agents (possibly up to some additive noise). More recently, numerous works extended this kind 
of algorithm to more involved multi-agent scenarios, see ||24| . Il25l . [1311 . [|22l . lfT6l . [[36ll as a non 
exhaustive list. Multi-agent systems are indeed more difficult to deal with, because individual 
agents ignore the global objective function to be minimized. [|24l| addresses the problem of 
unconstrained optimization, assuming convex but non necessarily differentiable utility functions. 
Convergence to a global minimizer is established assuming that utility functions have bounded 
(sub)gradients. Let us also mention ll36ll which focuses on the case of quadratic objective 
functions. Unconstrained optimization is also investigated in [[9l assuming differentiable but 
non necessarily convex utility functions and relaxing boundedness conditions on the gradients. 
Convergence to a critical point of the objective function is proved and the asymptotic performance 
is evaluated under the form of a central limit theorem. In ll25l . the problem of constrained 
distributed optimization is addressed. Convergence to an optimal consensus is proved when each 
utility function fi is assumed convex and perfectly known by agent i. These results are extended 
in [|3TI to the stochastic descent case i.e., when the observation of utility functions is perturbed 
by a random noise. 

In each of these works, the gossip communication scheme can be represented by a sequence 
of matrices {Wn}n>i of size N x N, where the {i,j)th component of Wn is the weight given by 
agent i to the message received from j at time n, and is equal to zero in case agent i receives no 
message from j. In most works (see for instance [l24|. Il25l . [[3T1l . flU), matrices Wn are assumed 
doubly stochastic, meaning that W^l = Wnl = 1 where 1 the A^ x 1 vector whose components 
are all equal to one and where ^ denotes transposition. Although row-stochasticity (Wnl = 1) 
is rather easy to ensure in practice, column- stochasticity (Wjl = 1) implies more stringent 
restrictions on the communication protocol. For instance, in ifTTTl . each one-way transmission 
from an agent i to another agent j requires at the same time a feedback link from j to i. 
Double stochasticity prevents from using natural broadcast schemes, in which a given agent may 
transmit its local estimate to all its neighbors without expecting any immediate feedback |3J. Very 
recently, [22j made a major step forward, getting rid of the column stochasticity condition, and 
thus opening the road to a novel broadcast based constrained distributed optimization algorithm. 



It is worth noting however that the algorithm of [|22ll is such that only receiving agents update 
their estimates. Otherwise stated, an agent deletes its local observations as long as it is not 
the recipient of a message. Moreover, except perhaps in some special network topologies, the 
algorithm of ||22]| strongly relies on a specific choice of the stepsize. In particular, a necessary 
condition for the convergence to the desired consensus is that the stepsize vanishes at speed 
1/n. However, in practice, it is often desirable to have a leeway on the choice of the stepsize 
to avoid slow convergence issues. 

Contributions 

In this paper, we address the optimization problem ([U) using a distributed projected stochastic 
gradient algorithm involving random gossip between agents and decreasing stepsize. 

• Unlike previous works, utility functions are allowed to be non-convex. We introduce a new 
framework for the analysis of a general class of distributed optimization algorithm, which 
does not rely on convexity properties of the utility functions. Instead, our approach relies 
on recent results of Q about perturbed differential inclusions. Under a set of assumptions 
made clear in the next section, we establish that, almost surely, the sequence of estimates 
of any agent shadows the behavior of a differential variational inequality, and eventually 
converges to the set of Karush-Kuhn-Tucker (KKT) points of ©• 

• Our assumptions encompass the case of non-doubly stochastic gossip matrices Wn and, as 
a particular case, the natural broadcast gossip scheme of [3j. Our proofs reveal that, loosely 
speaking, the relaxation of column stochasticity brings a "noise-like" term in the algorithm 
dynamics, but which is not powerful enough to prevent convergence to the KKT points. 

• We show that our convergence result still holds in case the number of communications in 
the network per unit of time vanishes at moderate speed as time increases. 

As an illustration, we apply our results to the problem of power allocation in the wireless 
interference channel. 

The paper is organized as follows. Section |ll] introduces the distributed algorithm and the main 
assumptions on the network and the observation model. The main result is stated in Section Ulll 
Section |IV] is devoted to the proof. We discuss applications to power allocation in Section |Vl 
Section |VI] describes some standard gossip schemes in more details, and provides numerical 
results. 



II. The Distributed Algorithm 

A. Description of the Algorithm 

Each node i generates a stochastic process {6n,i)n>i in M.^ using a two-step iterative algorithm: 
[Local step] Node i generates at time n a temporary estimate 6n,i given by 

On,i = PG[On-l,i+lnYn,i] , (2) 

where 7„ is a deterministic positive step size, YnA is a random variable, and Pq represents the 
projection operator onto the set G. Random variable F„ i is to be interpreted as a perturbed 
version of the opposite gradient of fi at point 9n-i,i- As will be made clear by Assumption [ije) 
below, it is convenient to think of Yn^i as Yn,i = —^ fi{0n^i,i)+5MnA where 5MnA is a martingale 
difference noise which stands for the random perturbation. 

[Gossip step] Node i is able to observe the values Onj of some other j's and computes 
the weighted average: 

N 
On,i = ^Wn{i,j)On,j 

i=i 
where for any i, J2j=i^n{hj) = 1- In the sequel, we define the N x N matrix Wn : = 

[Wn{i,j)]i,j=l-N- 

Define the random vectors On and Yn as On := (6*^1, • • • , ^nNY ^'^d Yn = (^n,i, • • • , Yn^NY ■ 
The algorithm reduces to: 



Or, = (W„ ® h)PG^ [On-1 + InYn] (3) 

where ® denotes the Kronecker product. Id is the dxd identity matrix and Pqn is the projector 
onto the A^th order product set G^ := G x ■ ■ ■ x G. 

B. Observation and Network Models 

Random processes {Yn, Wn)n>i are defined on a measurable space equipped with a proba- 
bility P. Notation E represent the corresponding expectation. For any ra > 1, we introduce the 
cr-field Tn = o"(^0; Yi,n, ^iin)- The distribution of the random vector Yn+i conditionally to Tn 
is assumed to be such that: 

P(r„+ieA|J-„) = /i0„(A) 



for any measurable set A, where {fie)g(z^dN is a given family of probability measures on R''^. For 
any G M'^^, define E0[5f(Y)] := J g{y)iig{dy) for any positive function g on M'^^. Similarly, 
we use notation E0[5f(Fi)] := J g{yi)iig{dyi x ■ ■ ■ x dyj^) for any positive function g on M°' and 
for any i = 1, ■ ■ ■ , A^. Denote by \x\ the Euclidean norm of a vector x. 

Assumption 1. The following conditions hold: 
(^) {Wn)n>i is a sequence of matrix-valued random variables such that: 

• Wn is row stochastic: Wn'^ = X 

• E(Vr„) is column stochastic: l^E(H/'„) = 1^, 

• The trace of¥.{WnWj^) is uniformly bounded. 

b) The spectral radius pn of matrix ¥.{Wn{lN ~ '^'^'^ /N)Wn) satisfies: 

lim ra(l — Pn) = +00 . (4) 

n—^oo 

c) For any positive measurable functions g, h, 

E[g{Wn+i)h{Yn+i)\J^n]=n9{Wn+i)]EeMY)] ■ 

d) For any i = 1, . . . ,N, fi is continuously differentiable. 

e) For any 6 = {Oj, ■ ■ ■ , 9%f, 

f) sup0gG^E0[|rp] < oo. 

g) MeeG^ is tight. 

We now discuss the above Assumption. Conditions [TJ a) andfTJb) summarize our assumptions 
on matrices Wn that is, on the gossip scheme used in the network. Following the seminal work 
of ifTTI . random gossip is assumed in this paper. Each matrix Wn must be row stochastic, this 
means that each agent i = 1, ■ ■ ■ , N must compute a weighted average J2j'^n{hj) = 1- Note 
that a quite classical condition in the literature is to further assume that Wn is column-stochastic 
for any n [|24ll . Il25l . ||3TI . ||9l . Column stochasticity inevitably goes with some restrictions on 
the communication protocol as discussed in Section HI Here, our assumption is weaker. We only 
require that Wn is column stochastic in average. This is for instance the case in the natural 
broadcast scheme of [HI which will be discussed in the Section III-CI The condition on the trace 



ofECWnW^) is immediately satisfied if coefficients Wn{i,j) are non-negative. It is also satisfied 
if the sequence of matrices {Wn)n>i is identically distributed. Assumption [2b) traduces a certain 
connectivity condition of the underlying network graph which will be discussed in more details 
at the end of this paragraph and in Section III-CI 

Assumptions [IJc-e) are related to the observation model. Assumption [TJc) implies that the 
random variables W^„+i and Yn+i are independent conditionally to the past. In addition, {Wn)n>i 
forms an independent sequence (not necessarily identically distributed). Assumption [IJe) means 
that each F„i can be interpreted as a noisy version of — V/i(6'„_ij). The distribution of the 
random additive perturbation F„ j — (— V/j(6'„_i j)) is likely to depend on the past through the 
value of 0„_i, but has a zero mean for any given value of 0„_i. 

Assumption 2. a) The deterministic sequence (7n)n>i is positive and such that X]n7" ~ °^- 
b) There exists a > 1/2 such that: 

lim n"7„ = (5) 

n— >oo 

liminf^^^^ >0 . (6) 

n^oa n"7„ 

Note that, when ([5]) holds true then ^^ 7^ < 00, which is a rather usual assumption in the 
framework of decreasing step size stochastic algorithms [[TSl . In order to have some insights 
on ^, first consider the case where the matrices {Wn)n>i form an i.i.d. sequence i.e., the 
spectral radius p := pn does not depend on n. Then both conditions ^ and ^ are satisfied if 
and only if: 

p < 1 . (7) 

Nevertheless, matrices {Wn)n>i do not need to be i.i.d. An interesting example is when matrix 
Wn is likely to be equal to identity with a probability that tends to one as ?i — > 00. From 
a communication point of view, this means that the exchange of information between agents 
becomes rare as n — )■ 00. This context is especially interesting in case of wireless networks, 
where it is often required to limit as much as possible the communication overhead. 

Consider for instance the case where 1 — p„ = a/n^ and 7„ = 70/^^ for some constants 
a, 7o > 0. Then, a sufficient condition for Assumption [21 is: 

< r/ < ^ - 1/2 < 1/2 . 



In particular, ^ e (1/2, 1] and 7] E [0, 1/2). 

C. Illustration: Some Examples of Gossip schemes 

Here, we focus on two standard gossip schemes and give the sequence {Wn)n>i corresponding 
to each of them. We refer the reader to |[T3l for a more complete picture and for more general 
gossip strategies. We introduce what we shall refer to as the pairwise and the broadcast schemes. 
The first one can be found in the seminal paper of Boyd et al. ifTTI on average consensus while 
the second is inspired from the broadcast scheme depicted in Q. The network of agents is 
represented as a nondirected graph (E,V) where E corresponds set of A^ nodes and V is the set 
of vertices. 

1) Pairwise Gossip: A time n, a single node i wakes up (node i is chosen at random, uniformly 
within the set of nodes and independently from the past). Node i randomly selects a node j 
among its neighbors in the graph. Node i and j exchange their temporary estimates 9n,i and 
9n,j and compute the weighted average 6n,i = Onj = POn,i + (1 — l^Wnj where < /3 < 1. 
Other nodes k ^ {i,j} simply set 6'„ ^ = 6'„fc. Set (3 = 1/2 for simplicity. In this case, the 
corresponding matrix Wn is given by Wn = In — (cj — ej)(ej — ej)^/2 where Cj denotes the 
2th vector of the canonical basis in M^. Note that for each n, Wn forms an i.i.d. sequence of 
doubly stochastic matrices. Assumption [TJ a) is obviously satisfied. Moreover, the spectral radius 
p of matrix E{Wn{lN - 11^/^)1^^) satisfies © if and only if (E,V) is a connected graph 
(see UJj). 

2) Broadcast Gossip: At time n, a random node i wakes up and broadcasts its temporary 
update to all its neighbors. Any neighbor j, computes the weighted average Onj = /39n,i + (1 — 
I3)9n,j- On the otherhand, any node k which does belong to the neighborhood of i (this includes 
i itself) simply sets 9n,k = dn,k- Note that, as opposed to the pairwise scheme, the transmitter 
node i does not expect any feedback from its neighbors. It is straightforward to show that the 
{k,i)th component of matrix Wn corresponding to such a scheme writes: 

1 if k^Afi and k = i 

/3 if A; e TVi and £ = i 

Wn{k,£) = < 

1-/3 if keAfiandk = i 

otherwise. 



As a matter of fact, the above matrix Wn is not doubly stochastic since l^Wn ^ 1^. Nevertheless, 
it is straightfoward to check that l'^E(iy„) = 1"^ (see for instance [3J). Thus, the sequence of 
matrices {Wn)n>i satisfies the Assumption [TJ a). Once again, straightforward derivations which 
can be found in |l3l show that the spectral radius p satisfies Q if and only if (E, V) is a connected 
graph. 

III. Main Result: Convergence w.p.1 

We study the case where for any i = 1, ■ ■ ■ , A^, the set G is determined by a set of p inequality 
constraints (p > 1): 

G:={9eW' : Wj = l,...,p, q,{9)<0} (8) 

for some functions gi, . . . , g^ which satisfy the following conditions. For any 9 E M.'^, we denote 
by A{9) C {1, . . . ,p} the active set i.e., qj{9) = if j G A{9) and qj{9) < otherwise. Denote 
by do the boundary of G. 

Assumption 3. a) The set G defined by dS]) is nonempty and compact. 

b) For any j = 1, ■ ■ ■ ,p, Qj : M."^ ^ M. is a convex function, continuously differentiable in a 
neighborhood of dG. 

c) For any 9 G dG, {Vqj{9) : j G A{9)) is a linearly independent collection of vectors. 



For any vector 6 G M , we note 



(0):=^(l^®/rf)0. (9) 



Equation Q simply means that (0) = (9i + ■ ■ ■ + 9n)/N in case we write = {9f, . . . , 9]^)'^ 
for some 9i, . . . ,9n in W^. Denote by: 

1 ^ 

i=l 

the average of utility functions. Define the set of KKT points of / on G (also called the set of 
stationary points) as: 

C:={9eG : -\/fi9)e^fG{9)} , 

where AfG{9) is the normal cone i.e., Mg{,9) := {v G W^ : W G G,v'^{9 - 9') > 0}. Define 
1 ® £ := {1 : 9 E C}. Define d(0, A) := ini{\0 - a\ : a E A} for any G W^^ and any 
set A. 
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Theorem 1. Assume that f\C) has an empty interior. Under As sumptions Ul El El the following 
holds w.p.l: 

lim d(0„, 1 ® £) = . 

Moreover, w.p.l, ((^„))n>i converges to a connected component of C. 

Theorem [H establishes two points. First, a consensus is achieved as n tends to infinity, meaning 
that maxj j |6'„j — 6'„ .,| converges a.s. to zero. Second, the average estimate (0„) converges to 
the set £ of KKT points. As a consequence, if C contains only isolated points, sequence (0„) 
converges almost surely to one of these points. 

In particular, when / is convex, {On) converges to the set of global solutions to the mini- 
mization problem ©. However, as already remarked, our result is more general and does not 
rely on the convexity of /. If / is not convex, sequence (0„) does not necessarily converge to 
a global solution. Nevertheless, it is well known that the KKT conditions are satisfied by any 
local minimizer fW\. 

The condition that /(£) has an empty interior is satisfied in most practical cases. From Sard's 
theorem, it holds as soon as / is (i times continuously differentiable. 

IV. Proof of Theorem 1 
A. Preliminaries: Useful Facts about Set-Valued Dynamical Systems 

Before providing the details of the proof, we recall some useful facts about perturbed dif- 
ferential inclusions. All definitions and statements made in this paragraph can be found in [7J. 
However, for the sake of readability and completeness, it is worth recalling some facts. 

Consider an arbitrary set- valued function F which maps each point 6^ G M'' to a set F{6) C M*^. 
Assume that F satisfies the following conditions: 

Condition 1. The following hold: 

• F is a closed set-valued map i.e., {{0,y) : y G F{9)} is a closed subset ofM'^ x R''. 

• For any 9 G M"', F{9) is a nonempty compact convex subset. 

• There exists c > such that for any 9 G M°', sup^gp^^g^ \z\ < c(l + \9\). 

A function x : M — )> R'^ is called a solution to the differential inclusion 
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if it is absolutely continuous and if -^^ G F(x(t)) for almost every t G M. For any t G M, 
e G R"^, define: 

$^(6') := {x(t) : X is a solution to ([JO]) s.t. x(0) = 9} . 

Let A be a compact set in W^. A continuous function y : R'^ — )■ M is called a Lyapunov function 
for A if the following two conditions hold: 

V^ G R"'\A,Vt > 0,Ve' G $i(^), 1/(^0 < 1/(e) 

v^ G A,vt > o,v^' G $i(^), v{e') < v{e) . 

Finally, a function y : [0, oo) — )■ M"' is called a perturbed solution to (flOl) if it is absolutely 
continuous and if there exists a locally integrable function t ^ U{t) such that: 

• For any T > 0, limi^oo supo<t,<T /^ ^ U{s)ds = 0, 

• There exists a function 5 : [0, oo) — )■ [0, oo) such that limt^oo S{t) = and such that for 
almost every t > 0, ^ - U(t) G F^^^\y(t)), where we define for any 6 > 0: 

F\e) ■= {zeR'^ : 36' eR'^,\e-e'\ <6,d{z,F{e')) <d}. (11) 

The following result due to |I3 will be revealed essential in our proofs. Denote by S the closure 
of a set S. 

Theorem 2 ([171). Let V be a Lyapunov function for A. Assume that V{A) has an empty interior 
Let y be a pertubed solution to 470l) . Then, 

fl^lM^cA. 



t>0 



Proof: The result is a consequence of Theorem 4.2, Theorem 5 and Proposition 3.27 in Q. 



B. Agreement between Agents 

Denote by J := (11^/A^) Id the projector onto the consensus subspace [l® 9 : 9 E M'^} 
and by J"*" := IdN — J the projector onto the orthogonal subspace. For any vector 6 G R'^^ , 
remark that 6 =l®{d) + J^O. 
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Lemma 1 (Agreement). Assume that G is a compact convex set. Under Assumptions \T\ and |2] 
^^>j^E |J"'"0„| < oo. As a consequence, J'^On converges to zero almost surely. 



Proof: We rewrite © as 0^ = (Wn ® Id){On^i + InZn) where 

PgN [On^l + InYn] " 6>„_l 



(12) 



7n 

Before going into the details of the proof of Lemma [B it is worth noting that |^nP < l^nP 
(just remark that On-i = PGN[6n-i\ and use the fact that G^ is convex). By Assumption [TJf), 
the sequence (E[|Z„p])„>i is therefore bounded. 

We now study J^0„. As Wnl = 1, it is straightforward to show that J^{Wn ® h) = 
J^{Wn ® Id)J^- As a consequence, J^On = J'^{Wn ® Id){J^On-i + 7„Z„). We expand the 
square Euclidean norm of the latter vector: 

Integrate both sides of the above equation w.r.t. the random variable Wn- 

E[| J^0„|2 I J-„_i, Z„] < p„| J^0„_i + -inZn? . 

Expand the righthand side and take the expectation. Using that p„ < 1 for n large enough, 

E[| J^6I„|2] < p„E[| J^6>„_i|2] + 27„E[| J^6>„_i| |Z„|] + ^iM^n?] • 
As E[|Z„p] is uniformly bounded, we obtain from Cauchy-Schwartz's inequality: 

E[| J^6I„|2] < p„E[| J^6l„_i|2 + 7„v/C7E[| J^6l„_i|2] + 7^^ 

for some constant C > 0. 

Let us denote w„ := E[| J-'-0„p]. Since 7„ still fulfills Assumption |2] when scaled by a constant 
factor, it is safe to assume: 



Vn < PnVn~l + 7n V^n-1 + 7n • (13) 



Let M„ := n^°f„ for some a > 1/2 satisfying ([5]) and Q. Then, 



Un < \^ + ^—[) PnM„_i + n"7„ M + -— y j ./Ti;:ri + n^''^l. (14) 



This implies in turn: 

Un - Mn-1 < {-anUn-1 + &„A/"n-l + C„) n°7„ 
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where 6„ = (1 + :;^y°', o.„ = ^~^"^" , and c„ = n°7„. A straightforward analysis of function 
(j)n ■ u h^ —anU + bny/u + Cn shows that u > tn implies 0n(n) < where t„ := {hn/an + Cn/hnY. 
Remark that, using Assumption [IJb), a„ ~ ^-^ and using Assumption [21 tn is bounded above, 
say by a constant K > 0. Moreover, when u < tn, 0n(^i) < (j)n{hn/2an) = Cn + bn^/Aan- Notice 
again that (f)n{bn/2an) is bounded above, say by a constant L > 0. We have proved that if 
Un~i < K then m„ < K+L and if m„_i > K,Un < Wn-i- This implies that m„ < raax{K+L, uq). 
Hence ^ i;„ < oo. ■ 

Lemma [U proves that agents asymptotically reach an agreement on their estimate. Another 
way to state Lemma [His to write that maxj ,,=i...7v \On,i — dn,j\ converges a.s. to zero as n tends 
to infinity. Therefore, the asymptotic analysis of the whole vector 6^ now reduces to the study 
of the average (On) = N^'^ Y^iLi ^n,i- 

C. Expression of the Average Estimate 

We introduce the following notation for any 7 > 0, G G^: 

9,{0) .= Ee — . 

Proposition 1. Under AssumptionsU}^\3\ there exists two stochastic processes {^n)n>i, ('"n)n>i 
such that for each n > 1: 

{On) = {dn-l) - 7nV/((0„_i)) + 7n^7n(^n-l) + In^n + InTn (15) 

and satisfying w.p.l: 



lim sup 

"^°o k>n 



k 



^li^l 



i=n 

lim r„ = 



(16) 



n— >oo 



Note that the third term in the righthand side of ([151) is zero whenever 0^-1 + 7nl^n lies 
in G^ i.e., when the projector is inoperant. In order to have some insights, assume just for a 
moment that this holds for any n after a certain rank. In this case, equation ([T5] ) simply becomes 

{dn) = {On-l) - 7nV/((6>„_i)) + ^nin + InTn ■ (H) 

In this case, by the continuity of V/ and using the above conditions on the sequences ^„ and r„, 
the asymptotic behavior of sequence {0n) can be directly characterized using classical stochastic 
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approximation results IfTSl . [HI, iH, lfT4]| . Indeed, a sequence (0„) satisfying (fTT] ) converges to 
the set of critical points of /. Nevertheless, the projector Pqn is generally active in practice, 
so that the term g^^{6n_i) may be nonzero infinitely often. This additional term raises at least 
two problems. First, it depends on the whole vector 0„ and not only on the average (0„): 
equation (fT5l) looks thus nothing like a usual iteration of a stochastic approximation algorithm. 
Second, g-y{0) is not a continous function of 6, whereas standard approaches often assume the 
continuity of the mean field of the stochastic approximation algorithm. 

D. Set-Valued Function and Inclusions 

Define /i := supggc'jv E0|"K|. Define the following set- valued function F on W^' which maps 
any 9 to the set: 

F{e) := {-Vf{e) -z:ze Maie), \z\ < 3/i} . (18) 

Using that f is continuously differentiable and that G is closed and convex, it can be shown that 

n 

F satisfies Condition l[U. Recall notation F^{6) in (fTTT) . Consider stochastic processes (^n5^n)n>i 
as in Proposition [U 

Proposition 2. Under As sumptions \1} |2] 12 there exists a sequence of random variables (5„).„>i 
converging a.s. to zero and an integer uq such that for any n > uq, 

{On) - {On-l) ^ ^ ^ T7& 



In 

Proof: Recall that: 



in-rn^F'-{{e^^,)) . 



N 



^7(^) = ^J2^o iPG[0^ + iY;\ -9,- 7F.) 
^ i=l 

From triangle inequality, it is straightforward to show that \gj(0)\ < 2/i. Remark that for any 
9 E G and any y G W^, the vector 9 +'^y — Pg[9 +'^y\ belongs to the normal cone A/g'(-Pg[^+7Z/]) 
at point PG[9 + 'yy]. Otherwise stated, Pci^ + T^] — d — lV can be written as a linear combination 
of the gradient vectors associated with the active constraints, where the coefficients of the linear 

'As a purely technical point, note that the third point in Condition 1 is satified only if |V/(6)| increases at most at linear 
speed when \9\ — s- oo, which has of course no reason to be true in general. This is however unimportant, as the values of 9 



will always be restricted to the bounded set G in the sequel. Moreover, for 6 ^ G, one can always redefine F{6) as in dlSt but 
replacing / with tp o f where <^ is a slowly increasing map chosen such that Condition 1 holds for ip o f. 
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combination are nonnegative. The latter linear combination is morevover unique due to the 
qualification constraint given by Assumption [Sjc). More precisely, if A{Pg[0 + 7y]) represents 
the active set at point Pg[9 + 7y] for any 9, 7, y, there exists a unique collection of nonnegative 
coefficients {Xj{9,^,y) : j e A{Pg[9 + 7y])) such that: 

^ j^A{PG[e+jy]) 

Throughout the paper, we use the convention that \j{9,'j,y) = in case j ^ A{Pg[9 + 72/]). 
The following technical lemma is proved in Appendix |Bl 

Lemma 2. Under Assumptions \T(f) and \3\ 

sup Eg[X,{9„-f,Yi)^]<oo . (20) 

i=l-N,j=l-p 



We rewrite gy{0) using expansion (fT9l) as: 

1 ^ 



The following function : R+ — )• M+ will be useful. Define: 

0(x):= sup |Vg,(^)-Vg,(^')l • (21) 

(e,e')eG^:\e-e'\<x 
j=i---p 

Since each gradient Vqj is continuous, it is uniformly continuous on the compact set G. Thus 
0(x) tends to zero as x i 0. Loosely speaking, when 7 is small and when all 6'j's are close to 
the average (6), the point PG[9i + 7^i] is close to {6). In this case, the uniform continuity of 
Vg^ implies that VqjiPcpi + 7Kj]) ~ \/qj{{6)). Lemma [3] below states a somewhat stronger 
result. For any e > and 9 e G, define A{9, e) as the set of constraints which are active at least 
for some point in an e-neighborhood of 9: 

A{9,e):={j = l,---,p : d(^, g7^({0})) < e} . (22) 

Lemma 3. Under Assumption\3\ there exists a constant C > and a function 7 H- e(7) on M+ 
satisfying lim^|oe(7) = such that the following holds. For any 6 G G^ and any 7 > 0, there 
exists (ai, ■ ■ ■ , Op) G [0, C]^ such that 



-9,(9)- Yl «.Vg,((0)) 

j6^((0>,£(7V|J-L0|)) 



<e(7V|J^0|). (23) 
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The proof is provided in Appendix O The sum in the lefthand side of (|23l) is a (nonnegative) 
linear combination of the gradient vectors of the constraints at point (0). However, this does 
not necessarily imply that this term belongs to the normal cone J\fci{0)) because, for a fixed 
e > 0, the set A{{6),e) is in general larger than the active set A({6)). Nevertheless, the 
following lemma states that A{{6),e) is no larger than a certain active set A(9') for some 
9' in a neighborhood of (6). 

Lemma 4. Under Assumption]^ there exists a function e i— )■ 5{e) on M.^ satisfying lim^j^o ^(f) = 
and there exists eo > such that for any < e < eo and any 9 ^ G, there exists 9' ^ G s.t.: 

1^ - ^'1 < (5(e) and A{9, e) C A{9') . (24) 

The proof is given in Appendix |Dl We put all pieces together. Consider constant C and 
functions e( . ) and 5{.) as in Lemma [3] and |4] respectively. Define e„ := e(7„ V | J"'"^„_i|) and 
6n ■= max(e„ + Gp(f){en) , 5(e„)). Clearly, e„ (and consequently 5„) converges to zero a.s. due 
to Lemma [Hand to the fact that 7„ — )■ 0. In particular, there exists an integer uq s.t. e„ < eo for 
any n > %. By Lemma HI for any n > uq, there exists ^^ G G satisfying \9'^ — (^n-i)l < H^n) 
and A{{6n-i),en) C A{9'j^). Thus, by Lemma[3l there exists (ai, ■ ■ ■ , ap) G [0, C]^, such that 

I - 9^„{9n^i) - Yl «i ^^J-((^"-i))l ^ e„ . (25) 

Define Zn := T^jeAie'j "i'^5i(^n)- Clearly, Zn G Ug{9'^). Using inequality ^, 

|-^,„(0„_i)-z„| < en + G Y. |Vg,(O-Vg,((0„_i))| 

< e„ + Cp0(|^;-(0„_i)|) <4. 

By inequality |5'7(6')| < 2/i, this moreover implies that \zn\ < 3/i provided that 5„ is small 
enough. Thus, 

d(-(7-,„(0„_i) , ArG(^;) n {z : 1^1 < 3/i}) < 6n 

for all but a finite number of n's. The proof of Proposition [2] is completed by using (fTSl) . ■ 

£■. Interpolated Process 

Define tq = and r„ := ^"=^7^ for any n > I. Define the continuous time process 6 : 



-^ R"^ as: 



e(r„.i+t):=(0„_i)+t'^^"^ ^^""'^ 



"^n ''"rj— 1 
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for any t G [0, 7„) and any n > 1. 

Proposition 3. Under Assumptions [7] |2] |i] the interpolated process O is a perturbed solution 
to (EOl) w.p.l. 

Proof: The proof follows more or less the same idea as the proof of Proposition 1.3 in [|7]|. 
There exists an event Qq of probability one such that 5„ — )■ and r.„ — )■ for any sample point 
CO E i^Q. From now on, we fix such an cu and we study function for this fixed sample point. 
For any n > 1 and r„_i < t < r„, ^^ = ((a„) - {6n-i))hn- By Proposition |2l 



dt 

G a + r„ + F''"((0„_i)) (r„_i<t<r„ 



^0(*) . . , . , c^^, 



The following property is easy to check. For any set-valued function F, any r G W'-, 5 > 0, 

Now, for any n and any r„_i < t < Tn, define ?7(t) := (5„ + |r„| + |6(t) — (^„_i) | and t/(t) = ^„. 
We obtain: 

de(t) 



-f/(t) G F''W(e(t)) . 

tends to zero 



dt 

r-t+V 



It is straightforward to show from (fT6l) that for any T > 0, supQ<„<y J^ ^ U{s)ds 

as t — 7- oo (we refer the reader to Proposition 1.3 in ^ for details). We now prove that r]{t) 

tends to zero as t — ?> oo. To this end, remark that for any 7:„„i < t < r„, 

\vit)\ < 5n+|r„| + |(0„)-(6»„_i)| 

< (5„ + (1 +7„)|r.„| +7„supV/(6') + 7„|5f^„(0„_i)| +7n|^n| • 

The first three terms of the righthand side of the above inequality converge to zero as 7„ — ?> 0, 
r„ — 7- 0, 5n -^ 0. The fourth term tends to zero as well because \g^{0)\ is uniformly bounded in 
(7, 9), as remarked in the proof of Proposition |2l Finally, 7„^.„ tends to zero by (fT6l) . Thus r](t) 
tends to zero as t — ?> 00. This completes the proof of Proposition |3l ■ 

When F is defined by (fTSi) . the differential inclusion (flOl) is equivalent to a differential 
variational inequality [[2l pp. 264]. By [|2l Proposition 2, pp. 266], any solution to (flOl) viable in 
G is a solution to the projected ordinary differential equation: 

rfx(t) 



dt 



eiVc(xW)(-V/(x(t))) 
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where Ptg{x) stands for the projection onto the tangent cone Tg{x) at point x. Based on this 
remark, it is straightforward to show that the objective function / is a Lyapunov function for 
the set of KKT points (we also refer to |[T71 . [1211 . IfTSl ). The proof of Theorem [1] then follows 
from Theorem [2l 

V. Application: Power Allocation in Ad-hoc Wireless Networks 
A. Framework 

The context of power allocation for wireless networks has recently raised a great deal of 
attention in the field of distributed optimization and game theory ll35l . [|5l, [|20l . Application of 
distributed optimization to power allocation has been previously investigated in ll29l . The present 
paragraph follows the same central idea as ||29l though in a rather different context. 

Consider an ad hoc network composed of N source-destination pairs. We focus on the so-called 
interference channel. The channel gain of the ith user is represented by a positive coefficient 
y4*'* which can be interpreted as the square of the modulus of the corresponding complex valued 
channel gain. As all agents share the same spectral band, user i suffers from the multiuser 
interference produced by other users j ^ i. We denote by A^'* is the (positive) channel gain 
between source j and destination i. In the sequel, we assume that there is no Channel State 
Information at the Transmitter (no CSIT) i.e., all channel gains are unknown at all transmitters. 
However, we assume that the destination associated with the zth source-destination pair 

• knows the set of channel gains A* := {A^'\ ■ ■ ■ , A^'*)^, 

• ignores all other channel gains A^ for j ^ i. 

Figured] below illustrates the interference channel with N = 2 transmit-destination pairs. Denote 
by p* the power allocated by user i. We assume that < p^ < Vi where Vi is the maximum 
allowed power for user i. Define 9 = {p^, ■ ■ ■ ,p^)^ as the vector of all powers of all users. 
The aim is to select a relevant value for parameter 9. We assume that destinations are able to 
communicate according to an underlying connected graph. The proposed algorithm works as 
follows. 
1) In a first step, the set of destination nodes cooperate and jointly search for a relevant global 

power allocation 9. The desired vector 9 corresponds to a local minimizer of an optimization 

problem which will be made clear below. 



No CSIT 



A 



1,1 
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Destination 1 observes A^'^ A^'^ 
~D1 



Gossip: joint searcti for 



D2 



Figure 1. Example of a 2 x 2 interference channel. 

2) Once an agreement is found on the power allocation vector 6, each destination i provides 
its own source with the corresponding power p' using a dedicated channel. 



B. Fixed Deterministic Channels 

First consider fixed deterministic channels. As a performance metric, consider the error prob- 
ability observed at each destination. Assuming for instance that each transmitter uses a 4-QAM 
modulation, the error probability at the ith destination is given by ll37l Section 3.1]: 



Pe,.(^,A^):=g 



^i,y 



(26) 



where af is the variance of the additive white Gaussian noise at the zth destination and where 
Q{x) = {\/2tt)^^ J°^ e~* ^"^dt. We investigate the following minimization problem: 



N 



mm 

eeG 



E/5^^e. 



A' 



where /3i is an arbitrary positive deterministic weight known only by agent i and where G : = 
{{p^,- ■ ■ ,p^) E M^ : Vz = 1, ■ ■ ■ , A^, < p^ < V^}. The above optimization problem is non- 
convex. Note that, utility functions (|26l ) can of course be replaced by any other continuously 
differentiable functions of the signal-to-interference-plus-noise ratio without changing the results 
of this section. 

Section HI] suggests the following deterministic distributed gradient algorithm. Each user i has 
an estimate 6n,i of the whole vector 9 at the nth iteration. Here, we stress the fact that a given 
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user has not only an estimate of its own power allocation p\ but has also an estimate of what 
should be the power allocation of other users j ^ i. Denote by On = {d^i, ■ ■ ■ , (^unY the vector 
of size N"^ which gathers all local estimates. Denote by A := ((A^)^, ■ ■ ■ , (A^)'^)'^ the vector 
which gathers all A^^ channel gains. The distributed algorithm writes: 

6>„ = {Wn ® Id)PG^ [On-i + In T(6l„_i; A)] ill) 

where for any 6 = {9'[, ■ ■ ■ , 9jf)'^ in M^ we set 

T(0; A) := (/3iVePe,i(^i; A^, ■■■ , ^NVePeAON^ A^'ff 

and where Ve is the gradient operator with respect to the first argument 6 of Pe,i{0, A^). 

C. Random Time-Varying Channels 

In many situations however, the channel gains are random and rapidly time-varying. In this 
case, it is more realistic to assume that each destination i observes a sequence of random channel 
gains (A^)„>i. The algorithm (ITTI) has the following immediate generalization: 

0n = {Wn ® h)PG^ [On^l + In T(6>„_i; A„)] (28) 

where we set A„ := {{A}^'^ , ■ ■ ■ , [A^)'^Y ■ Consider the following minimization problem: 

N 

min^AE[Pe,.(e,AJ,)]. (29) 

i=l 

Assume for simplicity that (A„)„>i is an i.i.d. sequence, so that the expectation in (|29l ) does 
not depend on n. Then, the following statement holds. 

Corollary 1. Under the stated assumptions on the sequences (7n)„>i and (Wn)n>i, the algo- 
rithm (1211) is such that sequence (0n)n>i converges to the set of KKT points of ^29\). 

From the point of view of practical implementation, it is worth noting that the objective 
function (|29l ) is likely to be quite flat in the neighborhood of its stationary points. To avoid slow 
convergence, it may be convenient to reparametrize problem (|29l ) in a relevant way. This point 
is addressed in Section |VI] where we use a gradient descent w.r.t. the powers in dB (simply, the 
logarithm of the powers) rather than the powers themselves. 
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VI. Numerical Results 
A. Scenario #1 

As a benchmark, we first address the convex optimization scenario formulated in [!22| . Define 
G C M^ as the unit disk in R^ centered at the origin. Consider the minimization of X]i=i /«(^) 
w.r.t. e ^ G, where for any i = 1, ■ ■ ■ , A^, fi{e) := E[{Ri - sje^]. Here, (i?i, ■ ■ ■ , Rn) is a 
collection of i.i.d. real Gaussian distributed random variables with mean 0.5 and unit variance, 
and (si, ■ ■ ■ , spf) is a collection of deterministic elements of M?. The number of agents is set 
as A^ = 10 or A^ = 50. We used to different graphs: the complete graph where any agent is 
connected to all other agent, and the cycle. We evaluate the performance of both pairwise and 
broadcast algorithms described in Section III-CI The weighting coefficient /3 used to compute the 
average is set to (3 = 0.5. As for comparison, we also evaluate the performance of the broadcast- 
based algorithm of |[22l . The common point between the algorithm of [|22ll and the broadcast 
algorithm described in Section Hl-CI is that they both rely on the broadcast gossip scheme of 
but the core of the algorithms is rather different as explained in Section HI In order to distinguish 
both broadcast algorithms, we will designate the algorithm of (22] as the broadcast algorithm 
with sleeping phases, refering to the fact that each agent does not update its estimates as long 
as it is not the recipient of a message. On the otherhand, we refer to the broadcast algorithm of 
Section III-CI as the broadcast algorithm without sleeping phases. 

It is worth remarking that a fair comparison between different stochastic approximation 
algorithms is generally a delicate task, because the behavior of each particular algorithm is 
sensitive to the choice of the stepsize. In this paragraph, we simply set 7„ = 7o/ri^ for all n, 
where 70 > and 0.5 < ^ < 1 are parameters chosen in an ad-hoc fashion. More degrees of 
freedom are of course possible when choosing 7„, but a complete discussion would be out the 
scope of this paper. Recall that the algorithm of ||22|| requires a more specific choice of the 
stepsize which solely depends on the initial step. We shall denote by 7q the latter initial stepsize 
used with the algorithm [22|, where the upperscript s stands for sleeping phases. 

For each algorithm, we evaluate the deviation of the estimates from the global minimizer ^^: 

^1/2 
A„, := I t:- 'V lOnA — OJ 



(l "" 



Note that A„ depends on the parameters (si, ■ ■ ■ , s^). We consider 50 Monte-Carlo runs, each 
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of them consisting of 10000 iterations of each algorithm. For each run, we randomly select 
the parameters (si, ■ ■ ■ , s^) according to the uniform distribution on the unit disk G. The kth 
Monte-Carlo run yields a sequence ( A„^ : 1 < n < 10000) for each algorithm. 



Figure [2] represents the average deviation (1/50) ^^.^^ An as a function of the number 



(fe) 



n 



of iterations. In Figure Oa), we set A^ = 50 and the graph is a cycle. In Figure Ob), we set 
A^ = 10 and the graph is a complete graph. It is worth noting that the pairwise gossip algorithm 
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Figure 2. Average deviation as a function of the number of iterations (a) Cycle, N — 50, jn = O.l/n"'^, 7q = 0.1 - (b) 
Complete graph, N = 10, 7„ = l/n°-^, 7o = 5 



outperforms both broadcast based algorithms. This fact might seem surprising at first glance. 
Indeed, in the framework of average consensus i.e., when the aim is not to optimize an objective 
function but simply to compute an average in a distributed fashion [[TT|. the broadcast gossip 
algorithm of Q is known to i) reach a consensus faster than the pairwise algorithm of [11] and 
ii) fail to converge to the desired value. In the context of distributed optimization, a different 
phenomenon happens: broadcast based optimizers do converge to the desired value, but converge 
slower than the pairwise algorithm. The convergence has been established by Theorem [IJ The 
relatively slower convergence of the broadcast-based algorithm can be interpreted if one has a 
closer look at the proof of Proposition [IJ The process ^n in the righthand side of equation ([T5l) 
plays the role of a random perturbation which slows down the convergence of (On). Appendix \A\ 
reveals that part of this perturbation ^„ is due to the fact that I'^Wn j^ 1^ (see the term ^i at 
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equation (|30l)). This part of the perturbation is clearly zero when Wn is doubly stochastic. This 
is the case for the pairwise algorithm, but not for the broadcast algorithm. 

As a conclusion, the pairwise optimizer outperforms the broadcast ones, but is also more 
demanding in terms of communication abilities of the agents as any one-way communication 
from an agent to another requires a feedback link. 

B. Scenario #2 

Consider the distributed power allocation algorithm of Section |V-B[ In order to validate the 
proposed algorithm, we study the 2x2 interference channels shown in Figure [B As a toy but 
revealing example, first assume fixed channel gains chosen as A^'^ = A^'^ = 2, A^-^ = A^'^ = 1. 
The noise variance is equal to cr^ = cr| = 0.1. The powers p^ and p"^ of the users must not 
exceed a maximum power of "Pi = 7^2 = 10. The aim is to minimize the weighted sum of 
the error probabilities as in (IV-BI) where (3i = 2/3, /32 = 1/3. Strictly speaking, we actually 
implement a distributed gradient descent w.r.t. to the parameter vector 9 in log-scale in order 
to avoid slow convergence. Figure [3]; a) represents the objective function (IV-BI) w.r.t. {p^,p'^) in 
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Figure 3. (a) Weighted sum of error probabilities for A'^ = 2 as a function of the powers p^ and p in dB - Fixed Determinitic 

channels - A^''^ = A'^-^ = 2, A'^-^ = A^''^ = 1 - /3i = 2/3, /32 = 1/3 - a? = cr| = 0.1 - Pi = P2 ^ 10. The minimum 
is achieved at point {p^,p^) — (10,5.4). (b) First agent's estimates of p^ and p^ as a function of the number of iterations - 
7„ = 200/n°'^ for n < 3000 - 7„ = 30/n°-^ for n > 3000. 



dB (the X-axis and y-axis are lOlog^gp^ and lOlog^Qp^ respectively). On this example, there 
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exists a unique minimum achieved at point (p^,p^) = (10,5.4). Figure [3lb) represents, on a 
single run, the trajectory of the estimates 6'„ i = {Pni,Pn2) of the first agent as a function of 
the number of iterations. We compare the pairwise and the broadcast gossip schemes. Note that 
we only plot the result for the broadcast scheme without sleeping phase, as we observed slow 
convergence of the algorithm of [22] on this particular example. The two upper curves represent 
the estimate of power pi (using a pairwise and a broadcast scheme respectively) while the two 
lower curves represent the estimate of power p2. Each algorithm converges to the desired value 
(10,5.4). However, the convergence curve is rather smooth in the pairwise case, and is more 
erratic in the broadcast case. Indeed, matrices Wn are non doubly stochastic in the broadcast 
scheme. As already explained above, non doubly stochastic matrices introduce an artificial noise 
term which is the main cause of the erratic shape of the trajectory. 

We finally provide numerical results in the case where channel gains are random and time- 
varying. We assume Rician fading [[371 Section 2.4]. For any n, we set EA^ = EA^'^ = 2, 
EA,;^'^ = Ey4^'^ = 1. The variance of each channel gain is 0.5. Components of A„ are assumed 
independent. Figure |4] represents the average trajectory of the estimates 6n,i = {Pni^Pn2) of the 
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Figure 4. Powers p^ and p^ as a function of the number of iterations, averaged w.r.t. 50 Monte-Carlo runs - 7„ = 200/71"'^ 
for n < 3000 - 7„ = 30/n°'^ for n > 3000. 



first agent as a function of the number of iterations. Trajectories have been averaged based on 
50 Monte-Carlo runs. Once again, we observe convergence of the distributed algorithms. The 
convergence is faster in the pairwise case. 
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VII. Conclusion 

We introduced a new framework for the analysis of a class of constrained optimization algo- 
rithms for multi-agent systems. The methodology uses recent powerful results about dynamical 
systems which do not rely on the convexity of the objective function, allowing this way to address 
a wider range of practical distributed optimization problems. Also, the proposed framework 
allows to alleviate the common assumption of double-stochasticity of the gossip matrices, and 
therefore encompasses the natural broadcast gossip scheme. The algorithm has been proved to 
converge to a consensus. The interpolated process of average estimates is proved to be a perturbed 
solution to a differential variational inequality, w.p.l. As a consequence, the average estimate 
converges almost surely to the set of KKT points. 
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Appendix A 
Proof of Proposition [H 

From ([3]) and Assumption [TJe), it is straightforward to show that the decomposition (flSl) holds 

if one sets: 

1 ^ 



i=l 



and ^n = Cn ^ + ^n^ where: 



e^ ■■= ;^ (^^^^^^ ® /.) Pg- [^n-l + 7nl^n] (30) 



We first prove that r„ tends to zero. Remark that: 

N 



rn\<^Yl I V/.((^n-l)) - V/.(^n-l, 



N 
1=1 



Each gradient V/j is continuous, and thus uniformly continuous on the compact set G. By 
LemmafU |(0„_i) — 9n~i,i\ converges to zero a.s. for any i. Therefore, r„ converges a.s. to zero. 
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To prove Proposition [H it is thus sufficient to show that supj;.>„ X]^=n7^C 



,(1) 



r(i) 



for j = 1,2. First, 



consider ,^„ . Recalling that Wn is row-stochastic, it follows that {{l^Wn - 1 ) ^ Id}J = 



Thus, one may write: 



in<,n 



(1) 






Id {J^0n-1 + InZn) 



where the random vector Z„ is given by (fT2l) . Define Af„ := Yllt=ilkQt ■ It is straightforward 
to show that A/„ is a martingale adapted to (J-'„)„>i. Indeed, by Assumption [TJc), Wn and Z„ 
are independent conditionally to J^n-i- Therefore: 

where we used l^E(iy„) = 1^ due to Assumption [TJa). We derive: 



EM„ 



fc=i 

A;=l 
n 

E^ 



N 



h] {J^Ok-i + ikZk] 



k=l 



(j^e,-. H- ,.z,f (Ml^lKlDfi^l!) « J, I „.<,,_, + ,,z,. 



^j.,^_^ ^ ^^^^,. , E|(>yji-i)(m-F)| ^ ^^ I (^,^^_^ ^ ^^^^, 



Remark that E[(iyf 1 - l){l^Wk - 1^)] = E[iyJll^Vrfc] - 11^. As the spectral radius of 
matrix E[W^Jll^M4] is uniformly bounded, there exists a constant C > such that: 

oo 

E|M,|2 < C"5^E[|J^0fe_i + 7feZfc|2] 
fc=i 

oo oo 

k=l k=l 

By Lemma [H the first term in the righthand side of the above inequality is finite. Recalling 
that \Zk\ < \Yk\, we deduce from Assumption [TJf) that E|Zfcp is uniformly bounded. As 
^^7| < oo, we conclude that sup„E|M„p < oo. This implies that the martingale converges 
a.s. to a finite random variable A/qo- Thus, for any k > n, 

k 



I^^^e 



(1) 



£=n 



|(M,-Moo)-(M„_i-Moo)| . 



Thus, sup^,>^ 



.(1 



^e=nle^e \ tends a.s. to zero as n — ;> oo. 
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We now study ^n ■ Clearly, C,n is a martingale difference noise sequence. Therefore, 



E 



j:^^^! 



(2) 



fc=l 



2" 


n r 




= i.7iE 




fc=i L 




oo 



lT 



AT 



E[mj-fc-i]) 



< J2^lE[\Y,f]<snpEe\YfJ2^l<oo 

k=l ^ k=l 



Thus, sup;,,>j 



Y^Ln^iQ 



tends to zero using the same arguments. This completes the proof 



of Proposition [TJ 



Appendix B 
Proof of Lemma [2] 

Let us define Q{9) as the matrix Q{9) := [Vgj(6')]jg^(5)). Denote by Ai(^) the smallest 
eigenvalue of Q{9)^Q{9). We first show that Ai(9) is lower semicontinuous i.e., for a sequence 
9n E G converging to 6*^, EG: 

Ai(^,) <liminfAi(a„) . (31) 

n 

Continuity of all functions Qj ensures that A(9) is upper semicontinuous, i.e., for any ^ in a 
neighborhood of 9^, A(9) C ^4(6'*). Hence, for n large enough A{9n) C ^4(6'*). Denote by Q{9n) 
the matrix d x p: 

QiOn) = [Vqj{9n)lAieM]j=i-P 

where 1^ stands for the indicator function of set A. There exists a sequence of p x 1 vectors 
Vn with unit norm such that |Q(^n)^nP = Ai(6'„) and f'„(j) = if j ^ A(9n). Since ?)„ has 
unit norm, one can extract a converging subsequence v^jyi^n) towards a unit norm p x I vector 
v^ such that |<5(6'<^(n))'S0(n.)P converges to liminf„ Ai(6'„). Using the inclusion A{9n) C A{9^) 
one has t'*(j) = when j ^ ^4(6'^,). Moreover, under Assumption Ob), functions Vqj are 
continuous, which implies that Q{9n) converges to Q{9^), hence vector v^ satisfies \Q{9*)v*\^ = 
\im.n\Q{9^{n))v,f,(n)\^- Sincc v^{i) = when j ^ A{9^), there exists a vector f* such that 
|Q(^*)t'*|' = |g(^*)C*p. Hence 



Ai(e.) < |g(^.)i^.r = IQ(^*)^*l = liminf Ai 
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This proves ( |3T| ). Under Assumption [3]^ a) G is a compact set, so lower semicontinuity of Ai{9) 
ensures that Ai reaches its minimum m > (m = would contradict Assumption Oc)). Now, 
let us denote by A := {\j{9, 7, y))J^^^g^^y) and v:= ^^{9 + -fy - Pg{9 + 71/)). One has 

A = (q{Pg{9 + ^y)fQ{PG{9 + 7^/)))"' Q {Pg{0 + iy)f v • 



Hence |A| < Ai^(Pg{9 + ^y))\Q{PG{9 + 'jy))'^v\. Continuity of Vg^ and compactness of G 
ensure the existence of L > such that: \Q{Pg{9 + 'yy))'^v\ < L\v\. To conclude, remark that 

If I < \y\ so lAI < — l-yl. Hence, 

II — ici II — m''^ ' 

Ee[A,(^„7,F,)2] < (-) Eem^] < 00 . 

Appendix C 
Proof of Lemma [3] 

Define constant A/i as the supremum in equation (|20|) : < Mi < 00 by Lemma [2l We set 

M2 = snpQ^G,j=i-p I^Q'i(^)l- Define for any x > 0: 



e{x) = ^/^+x + 2pv^0(v^ + x) + 2p^/m'iM2 ( sup E0{1\y\>i/^) ) 



1/2 



where we recall the definition (|2T]) of (p. Using the fact that 0(a;) tends to zero as x | and using 
Assumption [TJg), it is straightforward to show that e(x) tends to zero as x | 0. We decompose 

-g^{6) as -g^{6) =: s^{e) +t^{0) +u^{9) where: 

1 ^ 
^^W = ]^E^« E A,(ft.,7,>^.)l|y.|<i/V7Vgi(W) 

1 ^ 

1 ^ 

Consider first s^{6). When the indicator 'i-\Yi\<i/^ is active (equal to one), inequality |1^| < 
1/^7 holds true. In this case, 

\PG[0^ + lYi\-{O)\ < \Pa[9, + ^Yi\-9,\ + \9,-{e)\<\-fY,\ + \J^d\<^+\J^e\ 

< e(7V|J^6>|) . 
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Therefore, as soon as \Yi\ < 1/^, A{PG[Oi + 'jYi]) is included in the set A{{0),e{'y\/ \J-^9\)) 
where A is defined by (|22l) . As a consequence, 



j6^((0),e(7V|Ji0|)) 

where Oj := ^ ZliIilE0('^i(^i'7, ^i)l|y,|<i/v/7)- By Jensen's inequality, < a^ < y/M^ for 
any j. 

Consider the second term t^{6). It is straightforward to show from triangle and Cauchy- 
Schwartz's inequalities that: 

TUT ^ P 

1/2 



Finally, consider u^{6): 



N p 



\uJ0)\ < 



AT p 
i ■ 

< 



^ E E ^« (^^-(^^ ^' ^^)l|^d<i/V7 I Vg,(PG[^. + 7>^.]) - Vg,((0)))|) 



., N p 



i=l i=l 
Again, we use the fact that iPcldi + 7^i] — (^)| < ^7 "*" l"'^"'"^!- "^^ '^ i^ '^^'^ decreasing, it is 
clear that l|y^|<i/^0(|PG[6'j + 7^^] — {0)\) is no larger than (j){^/y + l-'^"'"^!)- Therefore, 

\u^{e)\ <p^,(t){^+\J^e\) <0.5e(7V|J^6/|) . 

This completes the proof of Lemma [3l 

Appendix D 
Proof of Lemma [4] 

For any e > 0, J = 1, ■ ■ ■ ,p, define dG'^ ■= {9 e G : 39' E q~\{0}), \9' - 9\ < e}. It is 
useful to remark that dG'j = q7^i{0}) fl G is the set of points in G for which the jth constraint 
is active. In particular, that (9G° C dG^j for any e > 0. Denote by dn the Hausdorff distance 
between sets. Define: 



^ ' ^ VjeE jeE / 
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The key point is to show that Hm^o S{e) = 0. By contradiction, assume that this is not the case. 
Then there exists a constant c > and a sequence e.„ I such that (5(e„) > c for each n. As 
there is a finite number of subsets of {1, ■ • ■ ,p}, it is straightforward to show that there exists 
a certain subset E C {1, ■ ■ ■ ,p} such that for any n > 1, 

\jeE jeE J 

First note that fljeE^^i" i^ nonempty. Indeed, if it was empty, Ojge^^'j would be empty as 
well, so that the Hausdorff distance in the lefthand side of (|32l ) would be dH(0,0) = < c. 
Thus, for any n > 1, there exists On E flies ^^j" ^^^^ ^^at 

^„, f|9Gn>c. (33) 

j<^E J 

The sequence {On)n>i lies in the compact set G. Thus, there exists a subsequence which converges 
to some point 6*^ G G. Without loss of generality, we shall still denote this subsequence by 
{dn)n>i in order to simplify the notations. We thus consider that lim„_^oo On = ^*- We shall now 
prove that 9^ e Oj^e'^^^j- Fo^ ^^y n > 1, 6^ E dG^f . Thus, there exists On'' E G such that 
Qj{6n ) = and |6'„ — 6n \ < e„. As Qj is convex, it is also Lipschitz on the compact set G. 
Denote by Kj its Lipschitz constant on G: 

Since Qj is continuous and e„ | 0, it follows that qj{6^^) = 0. Thus ^^ E OjeE'^^^- Therefore, 
by (l33l) . \9n — Oi,\ > c. This contradicts the fact that {9n)n>i converges to 6*^. This proves that 
(5(e) tends to zero as e I 0. 

It is useful to remark that, as a by product of the above proof, we also obtained the following 
result. Consider any set E C {1, ■ " " /p} and assume that there exists a sequence e„ | s.t. for 
any n > 1 there exists 6'„ E fljeE^^j"- ^^^ ^^ '•^^ above arguments, any limit point of such a 
sequence {9n)n>i belongs to the set flies ^^j which is thus nonempty. Let us state this result 
the other way around: for any E such that fljes ^^i ^ ^' there exists e^ > such that for 
any e < ef , fljeE^G^ = 0. We set eo = min{ef : fl^.E^GO = 0}. 

It remains to prove (l24l) . Let < e < eo and 9 E G. Trivially, 

9E n ^q- 
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As e < Co, the set flie^fe e) ^^i ^^ nonempty. There exists 9' in the latter set such that \9 — 9'\ < 
(5(e). By definition of 9', qj{9') = for any j e A{9, e). This proves that A{9, e) C A{9'). 
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